Explore the distribution of each of the x, y, and z variables in diamonds. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.
diamonds %>%
gather(key = dist, vals, x, y, z) %>%
ggplot(aes(vals, colour = dist)) +
geom_freqpoly(bins = 100)
One thing that is pretty obvious but perhaps hard to grasp at first is that the distribution of X and Y are pretty much the same. In fact, the same graph from above with bins = 30 won’t show you the X distribution because it overlaps perfectly. The correlation between the two is cor(diamonds$x, diamonds$y).
If we rounded each mm to the closest number, value-pairing x and y yields mean(with(diamonds, round(x, 0) == round(y, 0))) of the values with the same number. So far, the length is directly proportional to the y value.
diamonds %>%
filter(y < 30) %>%
select(x, y, z) %>%
ggpairs()
Yet the relationship between x and y with z is almost flat, as expected. That is, after excluding 2 diamonds which had unreasonable values.
Explore the distribution of price. Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth and make sure you try a wide range of values.)
## TODO: Fix the Y and X axis to be able to specify the cutting point in the distribution.
source("http://peterhaschke.com/Code/multiplot.R")
graph <- map(seq(50, 1000, 100),
~ ggplot(diamonds, aes(x = price)) +
geom_histogram(bins = .x) +
labs(x = NULL, y = NULL) +
scale_x_continuous(labels = NULL) +
scale_y_continuous(labels = NULL))
multiplot(plotlist = graph)
## Loading required package: grid
The distribution seems to decrease, as expected, but there’s a cut in the distribution showing that most prices are above or below a certain threshold.
How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference?
diamonds %>%
filter(carat %in% c(0.99, 1)) %>%
count(carat)
I have no idea. It could be that 0.99 is just a typo repeated 23 times.
Compare and contrast coord_cartesian() vs xlim() or ylim() when zooming in on a histogram. What happens if you leave binwidth unset? What happens if you try and zoom so only half a bar shows?
diamonds %>%
ggplot(aes(y)) +
geom_histogram() +
coord_cartesian(ylim = c(0, 50))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Note how xlim deleted the observations at 0.
diamonds %>%
ggplot(aes(y)) +
geom_histogram() +
xlim(c(0, 60)) +
coord_cartesian(y = c(0, 50))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Also note how xlim and ylim inside coord_cartesian don't exclude the data
diamonds %>%
ggplot(aes(y)) +
geom_histogram(bins = 30) +
coord_cartesian(xlim = c(2, 60), ylim = c(0, 50))
What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference?
diamonds %>%
ggplot(aes(price)) +
geom_histogram(bins = 1000)
In a histogram, they simply leave a gap in the distribution, as in the gap in the above histogram of price.
For the barplot, the function removes the NA value.
mtcars[1, 2] <- NA
mtcars %>%
ggplot(aes(cyl)) +
geom_bar()
## Warning: Removed 1 rows containing non-finite values (stat_count).
What does na.rm = TRUE do in mean() and sum()?
It removes the NA from the calculations.
7.5.1.1 Exercises
Use what you’ve learned to improve the visualisation of the departure times of cancelled vs. non-cancelled flights.
fl <-
flights %>%
mutate(
cancelled = is.na(dep_time),
sched_hour = sched_dep_time %/% 100,
sched_min = sched_dep_time %% 100,
sched_dep_time = sched_hour + sched_min / 60
)
fl %>%
ggplot(aes(sched_dep_time, ..density.., colour = cancelled)) +
geom_freqpoly(binwidth = 1/2)
fl %>%
ggplot(aes(sched_dep_time, colour = cancelled)) +
geom_density()
fl %>%
ggplot(aes(cancelled, sched_dep_time)) +
geom_boxplot()
What variable in the diamonds dataset is most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive?
display(lm(price ~ ., diamonds), detail = T)
## lm(formula = price ~ ., data = diamonds)
## coef.est coef.se t value Pr(>|t|)
## (Intercept) 5753.76 396.63 14.51 0.00
## carat 11256.98 48.63 231.49 0.00
## cut.L 584.46 22.48 26.00 0.00
## cut.Q -301.91 17.99 -16.78 0.00
## cut.C 148.03 15.48 9.56 0.00
## cut^4 -20.79 12.38 -1.68 0.09
## color.L -1952.16 17.34 -112.57 0.00
## color.Q -672.05 15.78 -42.60 0.00
## color.C -165.28 14.72 -11.22 0.00
## color^4 38.20 13.53 2.82 0.00
## color^5 -95.79 12.78 -7.50 0.00
## color^6 -48.47 11.61 -4.17 0.00
## clarity.L 4097.43 30.26 135.41 0.00
## clarity.Q -1925.00 28.23 -68.20 0.00
## clarity.C 982.20 24.15 40.67 0.00
## clarity^4 -364.92 19.29 -18.92 0.00
## clarity^5 233.56 15.75 14.83 0.00
## clarity^6 6.88 13.72 0.50 0.62
## clarity^7 90.64 12.10 7.49 0.00
## depth -63.81 4.53 -14.07 0.00
## table -26.47 2.91 -9.09 0.00
## x -1008.26 32.90 -30.65 0.00
## y 9.61 19.33 0.50 0.62
## z -50.12 33.49 -1.50 0.13
## ---
## n = 53940, k = 24
## residual sd = 1130.09, R-Squared = 0.92
# In a dirty way, carat
# Let's confirm the variation in carat for cut.
diamonds %>%
ggplot(aes(cut, carat)) +
geom_boxplot()
# It looks like it's weakly negatively correlated, so the fair diamonds having the greater carat.
diamonds %>%
ggplot(aes(carat, colour = cut)) +
geom_density(position = "dodge")
## Warning: Width not defined. Set with `position_dodge(width = ?)`
# It does like the Fair diamonds have the highest average carat but only by a little.
diamonds %>%
group_by(cut) %>%
summarise(cor(carat, price))
It does look like the carat and price are highly correlated between, as well as within, the quality of the diamond.
Install the ggstance package, and create a horizontal boxplot. How does this compare to using coord_flip()?
library(ggstance)
##
## Attaching package: 'ggstance'
## The following objects are masked from 'package:ggplot2':
##
## geom_errorbarh, GeomErrorbarh
diamonds %>%
ggplot(aes(cut, carat)) +
geom_boxplot() +
coord_flip()
diamonds %>%
ggplot(aes(carat, cut)) +
geom_boxploth()
It’s exactly the same plot but less verbose with the geom_boxploth(). Note that because the geom_boxploth() is already flipped, the variable order changes as well. The continuous variable goes in the x axis and the categorical in the y axis.
One problem with boxplots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of “outlying values”. One approach to remedy this problem is the letter value plot. Install the lvplot package, and try using geom_lv() to display the distribution of price vs cut. What do you learn? How do you interpret the plots?
library(lvplot)
p <- ggplot(diamonds, aes(cut, price, colour = ..LV..))
p + geom_lv()
p <- ggplot(diamonds, aes(cut, carat, fill = ..LV..))
p + geom_lv()
This plot ise usefull for having a more detailed description of the tails in a distribution. This works because each particular lv plot has both height and width. So for example, we can see that the upper tail for Fair has more values that the upper tail for Ideal. In a similar line, the distribution of Ideal is decreasing both in the number of carats as well as in the number of outliers as it increases towards the upper tail. That information is very difficult to get visually with a boxplot.
Compare and contrast geom_violin() with a facetted geom_histogram(), or a coloured geom_freqpoly(). What are the pros and cons of each method?
diamonds %>%
ggplot(aes(cut, price)) +
geom_violin()
diamonds %>%
ggplot(aes(price)) +
geom_histogram() +
facet_wrap(~ cut, scale = "free_y", nrow = 1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
diamonds %>%
ggplot(aes(price)) +
geom_freqpoly(aes(colour = cut))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The violin plot is extremely useful, at least to me, to compare the distributions. Histograms are trickier to compare, although they might be a bit useful when allowing the y axis to vary across plots. Freqpoly are both misleading because the frequency of each category influences greatly the visual display. In both plots we’d have to adjust for it by freeing th y axis (histogram) and plotting ..density.. in the y axis (freqpoly).
If you have a small dataset, it’s sometimes useful to use geom_jitter() to see the relationship between a continuous and categorical variable. The ggbeeswarm package provides a number of methods similar to geom_jitter(). List them and briefly describe what each one does.
How could you rescale the count dataset above to more clearly show the distribution of cut within colour, or colour within cut?
By calculating percentages and also showing the n.
diamonds %>%
count(color, cut) %>%
group_by(color) %>%
mutate(perc = n / sum(n)) %>%
ggplot(aes(color, cut, fill = perc)) +
geom_tile()
Use geom_tile() together with dplyr to explore how average flight delays vary by destination and month of year. What makes the plot difficult to read? How could you improve it?
One thing that makes it extremely difficult to read is that it is difficult to see differences in dep_delay because the higher values are driving the whole color palette upwards. Also, many dest have missing values on some months. Two solutions could be done: exclude dest with missing vallues for now and summarise, standardize or rescale the dep_delay so that we con spot differences.
library(viridis)
## Loading required package: viridisLite
library(forcats)
flights %>%
ggplot(aes(x = month, y = dest, fill = dep_delay)) +
geom_tile()
flights %>%
mutate(tot_delay = dep_delay + arr_delay) %>%
filter(tot_delay > 0) %>%
group_by(dest, month) %>%
summarize(dep_del_dev = mean(tot_delay, na.rm = T)) %>%
filter(n() == 12) %>%
ungroup() %>%
ggplot(aes(x = factor(month), y = fct_reorder(dest, dep_del_dev), fill = dep_del_dev)) +
geom_tile() +
scale_fill_viridis()
Why is it slightly better to use aes(x = color, y = cut) rather than aes(x = cut, y = color) in the example above?
diamonds %>%
count(color, cut) %>%
ggplot(aes(x = color, y = cut)) +
geom_tile(aes(fill = n))
Because the cut is ordered giving the impression of a scatterplot-type of intuition. Also, it’s better to have names that we to interpret constantly (and are a bit lengthy) on the y axis.
Instead of summarising the conditional distribution with a boxplot, you could use a frequency polygon. What do you need to consider when using cut_width() vs cut_number()? How does that impact a visualisation of the 2d distribution of carat and price?
ggplot(data = diamonds,
mapping = aes(x = price,
colour = cut_width(carat, 0.3))) +
geom_freqpoly()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
So cut_number makes N groups having the same number of observation in each group. Here there’s no contextual or substantive reason for different coding.
cut_width is more substancial. You make groups based on the metric of the variables.
In fact, there’s a tension between the two. The idea is that you reach an equilibrium between the two; categoris which have substantial meaning but also have reasonable sample sizes.
Here we get 10 groups with the same sample size.
ggplot(data = diamonds, aes(x=cut_number(price, 10), y=carat)) +
geom_boxplot()
The relationship seems a bit exponential with little differences between the smaller groups.
But with 10 groups based on substantive metrics:
ggplot(data = diamonds, aes(x=cut_width(price, 2000), y=carat)) +
geom_boxplot()
There are bigger differences between the bottom group. But that’s because the bottom groups now contain many more observations which captures the variability between groups (so within each of these groups the variance is small, that’s why in the previous plot we see very little difference in the small groups).
Visualize the distribution of cara, partitioned by price.
ggplot(diamonds, aes(carat, y = ..density.., colour = cut_width(price, 2000))) +
geom_freqpoly()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
How does the price distribution of very large diamonds compare to small diamonds. Is it as you expect, or does it surprise you?
diamonds %>%
filter(between(carat, 0, 2.5)) %>%
mutate(carat = cut_width(carat, 1)) %>%
ggplot(aes(price)) +
geom_histogram() +
facet_wrap(~ carat)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
I’m a bit surprised with the distribution of price for bigger diamonds. I was expecting very little variance! It seems as though big diamonds can cost anything between 5000 and 18000. Whereas small ones have very little variance.
Combine two of the techniques you’ve learned to visualize the combined distribution of cut, carat and price.
diamonds %>%
filter(between(carat, 0, 2.5)) %>%
mutate(carat = cut_width(carat, 1)) %>%
ggplot(aes(cut, price)) +
geom_boxplot() +
scale_y_log10() +
facet_wrap(~ carat)
We get the differences in price we saw before for carat groups but we also see that within each carat groups the differences between the cut in terms of prices is very little. It seems the be that the driving force of price is more carat rather than the cut.
Why is a scatterplot a better display than a binned plot for this case?
ggplot(diamonds, aes(x, y)) +
geom_point() +
coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))
Because binned plots tend to categorize continuous measures and might distort some relationships, like a linear one here.
ggplot(diamonds, aes(x, y)) +
geom_hex()
Another possibility would be to bin both continuous variables, as stated before, but that would categorize outliers into groups, eliminating any trace of anomalous or extreme values.